06. Forests of Randomized Trees

Random Forests

Random Forests are ensemble prediction algorithms that use both random column and random row selection. Each tree in the ensemble is created as follows:

  • If the number of rows in the training dataset is N, generate the dataset for each constituent tree by choosing N rows at random — but with replacement — from the original data.

  • If there are M columns in the training dataset, pick a number m<<M. At each node, select m columns at random out of the M and use the best split of possible splits on these m columns to split the node. The value of m is held constant during the forest growing. m is known as the max_features parameter, and the default value is sqrt(M).

  • Grow each tree to the largest extent possible.

For a regression tree model, use the average value of the ensemble of trees’ predictions. For a classification model, use the mode of the ensemble of trees’ predictions.

If you’d like to learn more, check out this paper.

L4 011 HS Random Forests V5